MultiUN v2: UN Documents with Multilingual Alignments
نویسندگان
چکیده
MultiUN is a multilingual parallel corpus extracted from the official documents of the United Nations. It is available in the six official languages of the UN and a small portion of it is also available in German. This paper presents a major update on the first public version of the corpus released in 2010. This version 2 consists of over 513, 091 documents, including around 9% of new documents retrieved from the United Nations official document system. Compared to the first release, we applied several modifications to the corpus preparation method. In this paper, we describe the methods we used for processing the UN documents and aligning the sentences. The most significant improvement compared to the previous release is the newly added multilingual sentence alignment information. The alignment information is encoded together with the text in XML instead of additional files. Our representation of the sentence alignment allows quick construction of aligned texts parallel in arbitrary number of languages, which is essential for building machine translation systems.
منابع مشابه
MultiUN: A Multilingual Corpus from United Nation Documents
This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set fo...
متن کاملPseudo-Aligned Multilingual Corpora
In machine translation, document alignment refers to finding correspondences between documents which are exact translations of each other. We define pseudo-alignment as the task of finding topical—as opposed to exact—correspondences between documents in different languages. We apply semisupervised methods to pseudo-align multilingual corpora. Specifically, we construct a topicbased graph for ea...
متن کاملHybrid Language Segmentation for Historical Documents
English. Language segmentation, i.e. the division of a multilingual text into monolingual fragments has been addressed in the past, but its application to historical documents has been largely unexplored. We propose a method for language segmentation for multilingual historical documents. For documents that contain a mix of highand low-resource languages, we leverage the high availability of hi...
متن کاملAutomatic Multi-label Subject Indexing in a Multilingual Environment
This paper presents an approach to automatically subject index fulltext documents with multiple labels based on binary support vector machines (SVM). The aim was to test the applicability of SVMs with a real world dataset. We have also explored the feasibility of incorporating multilingual background knowledge, as represented in thesauri or ontologies, into our text document representation for ...
متن کاملNews from OPUS — A Collection of Multilingual Parallel Corpora with Tools and Interfaces
The opus corpus is a growing resource providing various multilingual parallel corpora from different domains. In this article we introduce resources that have recently been added to opus. We also look at some corpus-specific problems and the solutions used in preparing the parallel data for the inclusion in our collection. In particular, we discuss the alignment of movie subtitles and the conve...
متن کامل